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Due to the rapid growth of urban vehicles, traffic congestion has become 
more serious. The signalized intersections are used all over the world and 
still established in the new construction. This paper proposes a self-adapted 
approach, called evolutionary reinforcement learning multi-agents system 
(ERL-MA), which combines computational intelligence and machine 
learning. The concept of this work is to build an intelligent agent capable of 
developing senior skills to manage the traffic light control system at any 
type of junction, using two powerful tools: learning from the confronted 
experience and the assumption using the randomization concept. The 
ERL-MA is an independent multi-agents system composed of two layers: 
the modeling and the decision layers. The modeling layer uses the 
intersection modeling using generalized fuzzy graph technique. The decision 
layer uses two methods: the novel greedy genetic algorithm (NGGA), and 
the Q-learning. In the Q-learning method, a multi Q-tables strategy and a 
new reward formula are proposed. The experiments used in this work relied 
on a real case of study with a simulation of one-hour scenario at Pasubio 
area in Italy. The obtained results show that the ERL-MA system succeeds 
to achieve competitive results comparing to urban traffic optimization by 
integrated automation (UTOPIA) system using different metrics. 
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1. INTRODUCTION 


The traffic congestion is one of the biggest problems related to big cities in the world. This problem 
can be frequented for several reasons such as peak periods and big events or at tourist destinations [1] which 
not only impacts people’s travel but also limits the development of the urban economy. Traffic control is one 
of the important tools to contain the congestion, restrain traffic flow and reduce emissions. Many researchers 
have proposed adaptive traffic signal control (ATSC) to solve the traffic congestion problem using a variety 
of optimization techniques, such as heuristics [2]-[4], evolutionary algorithms [5], [6] and self-organization 


strategy [7]. 


Several researchers have studied this problem using different concepts. The ATSC has the potential 
to efficiently adjust signal timing in real time conforming to travel demand, weather or seasonal traffic 
fluctuations. Systems using this technology have outperformed the actuated control and pre-timed methods 
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[8]. The two most famous ATSC are the split cycle offset optimization technique (SCOOT) [9] and Sydney 
coordinated adaptive traffic system (SCATS). The optimization process of those systems is based on real 
time data and predictions to assign a signal timing plan. Other control system like optimization policies for 
adaptive control (OPAC) [10] and real time hierarchical optimized distributed effective system (RHODES) 
[11] using similar principle where they intervene into the fluctuation of the time variation by dynamically 
adjusting the signal timing parameters. Also, the urban traffic optimization by integrated automation 
(UTOPIA) [12] is a hierarchical decentralized traffic signal control strategy used in many countries. It 
focuses on optimizing the traffic flows and gives selective priority to public transport with taking into 
consideration the travel times of the private traffic [13]. 

Recently, a new generation of traffic control system starts the implementation of machine learning 
methods, where multiple optimization and prediction techniques have been beneficial to create more 
powerful system with the ability to solve complex traffic condition issues [14]. Reinforcement learning (RL) 
is a third machine learning paradigm that is used for self-learning traffic signal control in the stochastic road 
traffic environment [15], [16]. Frequent studies have proved the capabilities to solve the ATSC problem 
using the Q-learning which is a RL method [15]. 

In this perspective, this paper presents a new approach called evolutionary reinforcement learning 
multi-agents system (ERL-MA) which is an independent multi-agents system composed of two layers. The 
first one is the modeling layer that uses the information provided by the intersection modeling using 
generalized fuzzy graph (IMGFG) [17] to understand the junction constraint and complexity. The second one 
is the decision layer which is composed of two activities: activity per cycle and activity per phase. In fact, the 
novel greedy genetic algorithm (NGGA) [18] is used in the activity per cycle to compose the optimal 
sequence of phases, and the accumulated knowledge of the Q-learning is applied to generate flexible signal 
traffic light timing in the activity per phase. 

A real-world scenario from Bologna city, built in the project of iTERTRIS (an integrated wireless 
and traffic platform for real-time road traffic management solutions), was prepared and described by Bieker 
et al. [19]. This real-world scenario is used as a case of study, which gives the opportunity for the best 
evaluation and appropriate demonstration of the proposed approach. This paper is organized. In the next 
section, the background of this study is presented. In section 3, the proposed approach is detailed. The 
experimental results are shown in section 4 where a real case of study is simulated. Finally, a discussion and 
conclusion are given in section 5. 


2. BACKGROUND 

This section will cover the background of the material presented in the paper. The first step in 
optimizing the signal timing for traffic congestion is finding a powerful model; this model should present the 
junction’s influencing parameters and display their correlation. The traffic congestion problem in signalized 
junction (or intersection) depends on traffic fluctuations, green timing, cycle length, phasing. Therefore, an 
efficient system should have a powerful model to present the influencing parameters of the junction. 
Modeling traffic behavior has been for years interesting issue research, traditionally accomplished using a 
variety of methods. As queuing theory [20], Dotoli and Fanti [21], cell transmission model [22] and 
intersection graph [17], [23], [24]. Boudaakat et al. [17] have introduced the intersection modeling IMGFG, 
where a generalized fuzzy graph concept with fuzzy vertices and fuzzy edges, through the modeling of an 
isolated intersection in two stages, steady elements in the first stage which leads to fuzzy graph with crisp 
vertices and a situational status which leads to a generalized fuzzy graph. In this work, the IMGFG is adopted 
as the modeling technique because this approach has the potential to present precisely the congestion points 
and conflicts level. 

Reinforcement learning has been proved to be an effective method for developing adaptive traffic 
signal controllers by adjusting signal timing in response to traffic fluctuations [25], [26]. The objectives of 
the Q-learning which is a technique of model-free reinforcement learning algorithm are gain experience to 
make better decisions. However, regarding all the advantages of the existing ATSC, the safety impact of 
applying these methods is still unclear. Some studies showed that implementing ATSC algorithms leads to 
lower traffic safety and extends traffic conflicts significantly [27] or a minor reduction on traffic collisions 
[28]. The mobility optimization does not necessarily assure the safety optimization, so discounting the traffic 
safety as a main objective in the existing ATSC is probably responsible of the inconsistency in the safety 
impact [29]. Therefore, the proposed ATSC is a new approach combining computational intelligence and 
reinforcement learning; it adjusts optimal signal timing in real-time and ensures the appropriate phasing with 
maximum traffic safety. 
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3. ERL-MA APPROACH 

The proposed approach is an independent multi-agents system, where each agent supervises a traffic 
light system (composed of mono-junction or multi-junction). The agent is designed to solve the traffic light 
control problem in such a way that fits with all junction formats (with multi-lanes or mono-lanes edges, 
directed and undirected lanes) and adapt traffic lights to the instantaneous flow changes. As depicted in 
Figure 1, the agent, in the proposed approach, consists of two main layers. The IMGFG modeling layer is 
responsible for modeling and transforming the junction to an adjacency matrix and provides the necessary 
parameters. The second layer is a decision part divided into two activities: activity per cycle and activity per 
phase. In the activity per cycle, the NGGA is applied to construct homogeneous clusters with the maximum 
traffic safety and fluidity depending on the congestion level and the traffic demand density. In the activity per 
phase, the Q-learning algorithm is employed to associate the adequate signal timing. The agent takes into 
account the occurred changes at the intersection in every cycle and phase to manage the intersection. 


Traffic network 
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Junction A 
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graph 


Sequence 


Green time i : 
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Action 
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Q-table; if k=i 
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Figure 1. The ERL-MA system 


3.1. The modeling layer 

The modeling layer uses the IMGFG technique [17] where a generalized fuzzy graph coloring 
approach is used to model any signalized intersection. All possible conflicts between outgoing lanes and the 
congestion level in lanes at the intersection are presented in this generalized fuzzy graph called the situational 
graph. In this generalized graph, the vertices represent lanes, the vertices weight are the level of congestion, 
the edges express the existing conflicts between those lanes, and the edges weight are the kinds of this 
conflicts. The adjacency matrix represents the situational graph that will be sent to next layer. The decision 
layer is divided into two activities: activity per cycle and activity per phase. 


3.2. The decision layer 
3.2.1. The activity per cycle 

In this activity, the NGGA combines genetic and greedy algorithm. The genetic algorithm is 
selected as the framework that explores the search space. Also, the rate at which the algorithm can explore a 
space of possibilities is satisfying and adequate to combinatorial problems. The greedy algorithm is added as 
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a function in the genetic algorithm to accelerate the search and improve chromosomes fitness [18]. In fact, 
the NGGA was implemented on fuzzy graphs and has proved its efficiency to this kind of problem [30]. In 
the proposed approach, the algorithm takes as an input the adjacency matrix generated from the IMGFG, and 
provides at the output an optimal combination of lanes (k-clusters) that ensures a high flux with the 
maximum safety. These clusters present phases that the agent will apply at the intersection during this cycle. 
After that, the agent sorts these phases referring to the average queue length and the waiting time of each 
phase to produce a sequence phasing set for the next step. 


3.2.2. The activity per phase 

In this activity, the agent uses the Q-learning algorithm to decide the green timing period for every 
upcoming phase sequence. The Q-learning strategy adopted in this work uses a set of Q-tables, where every 
Q-table concerns a cycle format. A Q-table is a matrix [31], where each row represents a specific state 
(intersection situation) and each column represents a specific action (green time). When the agent decides to 
apply a cycle with k-phases, the Q-table, is called to give the adequate green timing (action) for the 
associated phasing sequence (state). The agent compares all the Q-values of the requested state and the action 
with the highest Q-value is chosen. The characteristics of this activity are: 

Search space: In the signalized traffic control problem, the space of states and actions can be very 
large depending on the parameters used to present them. The search space grows exponentially when the 
states are numerous which impacts the Q-learning knowledge space (Q-learning matrices). Bigger the search 
space is less parts of the Q-matrix are explored. In the proposed approach, the Q-tables number varies from 
an intersection to another where the maximum and minimum numbers of phases are the first parameters to 
define in order to know how many Q-tables exist. In order to reduce the number of states describing all 
possible situations generated by the NGGA, states are represented with normalized forma where one state can 
describe a number of similar situations in every Q-table. Also, actions selected at every Q-table are limited to 
the maximum phasing time. 

State definition: The general format of a state at the search space is [K, CP, AQW table], where K 
represents the number of phases in the cycle, CP the current phase index and AQW table the average of the 
queue weight at every phase. The states existing in every Q-tablex are represented with [CP, AQW table], 
where the average queue weight values are limited at Ø, low (L), medium (M) and high (H) and the average 
queue weight table size depends on the number of phases used at the sequence phasing, see Figure 2. 


Current Phase 


| u m| H — AQW table 
T 


K 


Figure 2. Illustration of a state in cycle with four phases 


Action definition: Action presents green time phasing. For every state in the knowledge matrices 
(Q-tables) different timing [min phase time, max phase time] are subject of test in the learning stage. The 
max green time per phase differs from a Q-tablex to another and depends on the max cycle time and the 
number of phases (k) in the cycle. The best-rated actions are actions with the max benefit for the current 
phase and the less negative impact on the intersection. 

Reward definition: Any action applied on a phase has a positive impact on the related lanes and a 
negative impact on the other lanes. In order to get the action with the max benefit to the intersection, this 
discrimination is necessary to balance action/effect. The reward is calculated in (1) based on the effect of the 
selected action in all lanes of the intersection, where the degree of passed vehicles and the queue weight 
change are used as a metric parameter for lanes belonging to the current phase; for the lanes belonging to 
other phases, the average of the cumulative waiting time and the occurring queue weight are used. 


NbrPhases OWA j*(WTAj—-WTB 
Ri = PV, * (1 + (QWB; — QWA,)) — EN jei QWAj*(WTAj-WTB;) 


(NbrPhases—1)*Ai 


(1) 


Where, PV; is the degree of passed vehicles at the current phase. QWB; is average queue weight of the current 
phase before the action. QWA; is average queue weight of the current phase after the action. QWA; is average 
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queue weight of the j phase after the action. WTA; is cumulative waiting time of the j phase after the action. 
WTB; is cumulative waiting time of the j phase before the action. Aj is current action. 

Q-values and coefficient: The Q-table in (2) defines the knowledge space where every Q-table, 
presents a sub-space with a common format; k is the number of phases presenting the state. 


Q-table;, k =i 
Q-table.4,, k=it+1 
Q-table(k) = f (2) 
Q-table;, k=j 


The Q-learning uses the experience of each state transition to update one element of a Q-tablex. A Q-tablex is 
a matrix in which each row represents a specific state and each column represents a specific action. Each cell 
in this matrix represents a Q-value for a specific state-action pair Qx (s, a) [32]. The Q-value, in general, is 
used to compare various actions at a specific state. The Q-learning algorithm improves its policy by updating 
the Q-table; in (3) according to Bellman’s equation [33]. 


Qi*t(s', a") = QE (st, at) +*+ [+t + y max Qi (s**t, attt) — Q} (st, a*)] (3) 


Where s’, a' is the current state and the selected action at the current state. Q't’, Q' is the updated and the old 
Q-value. r*+! is the reward of applying action at state s’. s‘*’, a’! is the new state and the best action at the 
new state. a’*/ is the learning rate. Iis the discount rate. 

Training stage: Before testing the ERL-MA system, a learning phase is mandatory. Every agent 
undergoes a training stage to acquire the knowledge of behaving with the intersection, and be able to assign 
the adequate action to a current situation taking into consideration the structure of the intersection and impact 
of the action on it. The policy matrices are set to zeros; then, the agent passes to learning stage where it is 
exposed to a random generated scenario combining low, medium and high flows in order to access at the 
maximum state-actions component of the Q-tables. Also, the action selection is subject to a noise coefficient 
to stimulate the exploration. In result, the agent becomes ready for the evaluation with optimal policy 
matrices. The agent does not stop upgrading the optimal policy matrix in this stage, but is able to become 
greater with the future experience. 


4. EXPERIMENTAL RESULTS 
4.1. Simulation environment 

In this paper, simulation of urban mobility (SUMO) [34] is used as the simulation environment. It is 
a free and open-source microscopic traffic simulation. It has been available since 2001 and allows modeling 
of intermodal traffic systems including road vehicles, pedestrians and public transport; it is a tool widely used 
for traffic research. 


4.2. Testbed network 

Traffic light signal plans are rarely open to the public and are often not available in digital format. 
To replicate a part of a real road network, gathering, converting, and adapting all the data is time-consuming. 
Also, the correction and the validation of the responsible municipality of the studied area are hardly possible 
and mandatory to allow performing real-world evaluations and fair comparisons. Real-world scenarios from 
Bologna built in the project of {TERTRIS (co-funded by the European Commission and contributed by the 
municipality of bologna as a project partner) where all the previous conditions are respected, prepared and 
available to the public within the SUMO package by Bieker et al. [19]. The proposed approach is applied on 
the network Pasubio area in Bologna in Italy as shown in Figure 3. 

The Pasubio scenario extends the scenarios by the area around the hospital and includes also 
common routes to the football stadium as shown in Figure 4. Due to the situation and the traffic problems in 
Bologna, the municipality of Bologna delivered a large set of data and simulation scenarios. The given data 
included representations of the areas around the Pasubio roads, as input files for the commercial microscopic 
traffic simulation. The scenarios modeled the peak hour in Bologna (8:00-9:00 am) [19]. 

The congestion level in Bologna in 2019 is 25% and 205 ranked in the world, the congestion level 
by road type in highways is 17% and in non-highways is 31%. The real-world scenarios from Bologna were 
prepared and are described under project iTETRIS which is an integrated wireless and traffic platform for 
real-time road traffic management solutions to help estimate road traffic engineering. The scenario of 
Bologna traffic was built to illustrate the traffic congestion in both areas Pasubio and Andrea Costa. 
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Figure 3. Location of Bologna in Italy 


Stadium / 


Figure 4. Pasubio traffic network in SUMO 


The Pasubio scenario of Bologna city in Italy [19] presented in Figure 4 was chosen to test the 
proposed approach. The selected area represents 2.45 km? of a real city with a total of eight traffic lights, 
sixty-five nodes and one hundred eleven edges. the scenario displays the real traffic at the morning peak hour 
in 8:00-9:00 am. The city of Bologna uses the UTOPIA system for traffic light control. UTOPIA [35] 
optimizes traffic light schedules and sorts the traffic light phases to satisfy traffic demand. 


4.3. Simulation 

The tested area is composed of 8 traffic lights control system. Some of them use multi junctions 
(intersections) with one traffic lights control system. Every traffic light control system (TLC) is supervised 
by an agent as shown in Table 1. Every agent undergoes the training stage separately until it acquires the 
necessary skills. As shown in Figure 5, the Junctions 4 and 14 are controlled by the TLC 230 and the agent 1 
supervise it. The accumulated knowledge and the scenarios exposed during the learning stage for agent lare 
presented in Figures 6 and 7. All the others agents of the network are exposed to the same scenarios. 


Table 1. Junctions and controlling agents 


Agent TLC Id Junctions Ids 

1 230 14,4 

2 231 9, 10, 12 
3 232 29,27, 

4 233 15 

5 220 36 

6 219 1, 32 

7 282 18 

8 218 0, m0 
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Figure 5. Illustration of junctions with the associated TLC 


Agent one is exposed to a random generated scenario combining low, medium and high flows in 
order to access at the maximum state-actions component of the Q-tables, Figure 6 show the quantities of the 
new knowledge acquired at the end of every scenario; in the begging of the learning phases the quantity of 
the new knowledge is massive because the agent is uninformed, with time the new knowledge accumulated 
per scenario get lower until it become insignificant, in this point the optimal policy is obtained and the agent 
is ready for evaluation. Figure 7 show the associated running time to all the random scenarios applied to the 
agent 1 at the learning stage. 


70004 ~- Q-table[k=2] 
»—> Q-table[k=3] 
| — Q-table[k=4] 


Q-values variation 
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Figure 6. The acquired knowledge variation in the learning phase 
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Random Generated Scenarios 


Figure 7. The running time of random generated scenarios in the learning phase 


4.4. Approach evaluation 

A real-world scenario will evaluate the proposed approach. The network area is 2.45 m? with eight 
traffic lights, 8,776 loaded vehicles, and different types of signalized intersections; the proposed system is 
compared to the UTOPIA system implemented at the Pasubio network. After, applying the learning phase to 
all agents controlling the TLC in the Pasubio area, the evaluation of the same scenario provided as data test 
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were run in 50 simulations to evaluate the performance of the proposed approach comparing the UTOPIO 
system. Figure 8 shows that the proposed approach exceeds the UTOPIA system in 100% with a performance 
of [6.60%, 12.44%] less time. The UTOPIA system finishes the simulation in 4,605 sec and the proposed 
approach running time varies between 4,032 and 4,301 sec; this non-stability is due to the characteristic of 
the NGGA which is a stochastic algorithm. The effect of the randomness characteristic can be seeing at the 
first look as a weakness, but with a deep look it should be seeing as a powerful tool that distinguishes this 
approach, having this characteristic mean that agent is always able to benefice from new experiences and 
develop new skills to manage the traffic light control system. 

The waiting time graphs results show that the average waiting time was controlled by the ERL-MA 
perfectly compared to UTOPIA. Agent one at junction four and agent eight at junction zero found some 
difficulty controlling the waiting time. However, generally, they succeeded in handling it compared to 
UTOPIA, as shown in Figure 9. Likewise, the simulation results presented in Figure 10 show that ERL-MA 
agents have improved the queue length of all junctions comparing to the UTOPIA system. 
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Figure 9. Illustration of the average waiting time at the Pasubio junctions 
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Figure 10. Illustration of the max queue length at the Pasubio junctions 


As a result of the above graphs, the CO2 and CO have been significantly reduced. Due to the shorted 
waiting time and accumulated queue length of vehicles. Also, the lower simulation time have contributed at 
CO2 and CO reducing. The ERL-MA approach reduced CO2 and CO emissions produced by vehicles, as 
depicted in Figure 11. 

The experiments results determined that ERL-MA system can generate a flexible traffic signal 
timing to traffic variation for the Pasubio scenario and different intersection forma. The proposed approach 
was tested using a real-world scenario. The approach was compared with the implemented system at the 
Pasubio network. The ERL-MA agent use the information provided by the IMGFG to understand the 
intersection constraint, the NGGA to compose the optimal sequence phasing, and the accumulated 
knowledge of the Q-learning to generate a flexible signal traffic light timing. However, the model can 
certainly be extended to consider different aspects such as the communication between agents or between 
lights and vehicles. The ERL-MA has the ability to obtain excellent solution for traffic light control problem. 
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Figure 11. Illustration of CO and CO, emission of vehicles 


5. CONCLUSION 

The proposed approach uses two powerful tools of human beings which are learning and assumption 
to build an intelligent agent capable of managing the traffic light control system. Agents are designed to fit 
with junction format and have the potential to adjust signal timing conforming to the instantaneous changes. 
The proposed approach does not request expensive changes or massive conditions demanding. Furthermore, 
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it does not impose changes at the infrastructure of the intersection or requires direct communication with 
vehicles. This approach was designed to fit and self-adapt with intersection infrastructure and TLC. 
Computational intelligence, machine learning approach, and fuzzy graph modeling are used to build this 
approach. The proposed system ERL-MA was evaluated with a real-world scenario and compared to the 
existing system. The obtained results showed that the proposed approach succeeded to achieve competitive 
results using different metrics. 
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